relative prevalence
How Large Language Models Are Changing MOOC Essay Answers: A Comparison of Pre- and Post-LLM Responses
Leppänen, Leo, Aunimo, Lili, Hellas, Arto, Nurminen, Jukka K., Mannila, Linda
The release of ChatGPT in late 2022 caused a flurry of activity and concern in the academic and educational communities. Some see the tool's ability to generate human-like text that passes at least cursory inspections for factual accuracy ``often enough'' a golden age of information retrieval and computer-assisted learning. Some, on the other hand, worry the tool may lead to unprecedented levels of academic dishonesty and cheating. In this work, we quantify some of the effects of the emergence of Large Language Models (LLMs) on online education by analyzing a multi-year dataset of student essay responses from a free university-level MOOC on AI ethics. Our dataset includes essays submitted both before and after ChatGPT's release. We find that the launch of ChatGPT coincided with significant changes in both the length and style of student essays, mirroring observations in other contexts such as academic publishing. We also observe -- as expected based on related public discourse -- changes in prevalence of key content words related to AI and LLMs, but not necessarily the general themes or topics discussed in the student essays as identified through (dynamic) topic modeling.
Quantifying disparities in intimate partner violence: a machine learning method to correct for underreporting
Shanmugam, Divya, Hou, Kaihua, Pierson, Emma
Estimating the prevalence of a medical condition, or the proportion of the population in which it occurs, is a fundamental problem in healthcare and public health. Accurate estimates of the relative prevalence across groups -- capturing, for example, that a condition affects women more frequently than men -- facilitate effective and equitable health policy which prioritizes groups who are disproportionately affected by a condition. However, it is difficult to estimate relative prevalence when a medical condition is underreported. In this work, we provide a method for accurately estimating the relative prevalence of underreported medical conditions, building upon the positive unlabeled learning framework. We show that under the commonly made covariate shift assumption -- i.e., that the probability of having a disease conditional on symptoms remains constant across groups -- we can recover the relative prevalence, even without restrictive assumptions commonly made in positive unlabeled learning and even if it is impossible to recover the absolute prevalence. We conduct experiments on synthetic and real health data which demonstrate our method's ability to recover the relative prevalence more accurately than do baselines, and demonstrate the method's robustness to plausible violations of the covariate shift assumption. We conclude by illustrating the applicability of our method to case studies of intimate partner violence and hate speech.
Human natural selection adding to 'nearsightedness epidemic'
Natural selection among humans is adding to the'epidemic' of nearsightedness, with each successive generation in the UK gaining more than 100,000 extra cases. It is estimated that around half of the world's population -- some 4.9 billion people -- will suffer from the distant visual impairment by the middle of the century. Much of the problem is environmental -- with increased screen time and not enough spent outdoors using our long-distance vision often blamed. However, US experts have found that many of the genetic variants that increase the risk of nearsightedness, or myopia, are also associated with reproductive benefits. Thus, those with these genes are likely to have more children, and from a younger age, increasing the relative prevalence of myopia-causing genes in the population.